First part of this notebook based on Karthik Ram’s GGPLOT2 Lecture (CC-By 2.0)
GOALS: Students should be able to use ggplot2 to generate publication quality graphics and understand and use the basics of the grammar of graphics.
ggplot2 is built on the grammar-of-graphics:ggplot2 is thinking about a figure in layers – think of ArcGIS or programs like Photoshopgeom_point(), geom bar(), geom density(), geom line(), geom area()Viz
R gives us a concept of a project and it is good practice to create one when you start a project.
File > New projectNew directory > R projectdata-viz click okNew Folder called dataWe are ready to start.
We can load data into R via multiple mechanisms:
data(). More datasets will show after package installation and loading.data()
library(openintro)
Please visit openintro.org for free statistics materials
Attaching package: 'openintro'
The following objects are masked from 'package:MASS':
housing, mammals
The following object is masked from 'package:ggplot2':
diamonds
The following objects are masked from 'package:datasets':
cars, trees
data()
data():data(diamonds)
head(diamonds)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#View(diamonds)
csv you can write it out using the write_csvwrite_csv(diamonds, 'data/diamonds.csv')
csv using the download.file()/ and the file name you want to save as.#download.file(url, path/filename)
download.file("https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv", 'data/gapminder-FiveYearData.csv')
read functions that read data in as a data frameread_csv('data/gapminder-FiveYearData.csv')
Parsed with column specification:
cols(
country = col_character(),
year = col_integer(),
pop = col_double(),
continent = col_character(),
lifeExp = col_double(),
gdpPercap = col_double()
)
# A tibble: 1,704 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Afghanistan 1952 8425333 Asia 28.8 779
2 Afghanistan 1957 9240934 Asia 30.3 821
3 Afghanistan 1962 10267083 Asia 32.0 853
4 Afghanistan 1967 11537966 Asia 34.0 836
5 Afghanistan 1972 13079460 Asia 36.1 740
6 Afghanistan 1977 14880372 Asia 38.4 786
7 Afghanistan 1982 12881816 Asia 39.9 978
8 Afghanistan 1987 13867957 Asia 40.8 852
9 Afghanistan 1992 16317921 Asia 41.7 649
10 Afghanistan 1997 22227415 Asia 41.8 635
# ... with 1,694 more rows
gapminder <- read_csv('data/gapminder-FiveYearData.csv')
Parsed with column specification:
cols(
country = col_character(),
year = col_integer(),
pop = col_double(),
continent = col_character(),
lifeExp = col_double(),
gdpPercap = col_double()
)
This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
library(ggplot2)
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
Basic structure
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
myplot <- ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width))
myplot + geom_point()
ggplot function.Increase size of points
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(size = 3)
Make it colorful
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 3)
Differentiate points by shape
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(aes(shape = Species), size = 3)
# Make a small sample of the diamonds dataset
d2 <- diamonds[sample(1:dim(diamonds)[1], 1000), ]
Then generate this plot below. (open 09-plot-ggplot2-ex-1-1.png)
ggplot(d2, aes(carat, price, color = color)) + geom_point() + theme_gray()
Switch to Gapminder Data
library(readr) #from tidyverse
#download.file("https://goo.gl/BtBnPg", 'data/gapminder-FiveYearData.csv')
#gapminder <- read.csv("https://goo.gl/BtBnPg")
gapminder <- read_csv('data/gapminder-FiveYearData.csv')
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
NOTE:
ggplot function -any arguments we provide the ggplot function are considered global options: they apply to all layersggplot:aes function - which tells ggplot how variables map to aesthetic propertiesx & y locationsAlone the ggplot call isn’t enough to render the plot.
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap))
## If run, would produce an error.
Need to tell ggplot how we want to present variables by specifying a geom layer. In the above example we used geom_point to create a scatter plot.
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point()
See ?geom boxplot for list of options
library(MASS)
head(birthwt)
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
87 0 20 105 1 1 0 0 0 1 2557
88 0 21 108 1 1 0 0 1 2 2594
89 0 18 107 1 1 0 0 1 0 2600
91 0 21 124 3 0 0 0 0 0 2622
ggplot(birthwt, aes(factor(race), bwt)) + geom_boxplot()
ggplot(birthwt, aes(factor(race), bwt)) + geom_boxplot() +
scale_x_discrete(labels = c("white", "black", "other"))
See ?geom histogram for list of options
h <- ggplot(faithful, aes(x = waiting))
h + geom_histogram(binwidth = 30, colour = "black")
h <- ggplot(faithful, aes(x = waiting))
h + geom_histogram(binwidth = 8, fill = "steelblue", colour = "black")
Download data
#download.file('https://raw.github.com/karthikram/ggplot-lecture/master/climate.csv', 'data/climate.csv')
#climate <- read.csv(text=RCurl::getURL(https://raw.github.com/karthikram/ggplot-lecture/master/climate.csv))
Anomaly10y is a 10-year running average of the deviation (in Celsius) from the average 1950–1980 temperature:climate <- read_csv("data/climate.csv")
Warning: Missing column names filled in: 'X1' [1]
Parsed with column specification:
cols(
X1 = col_integer(),
Source = col_character(),
Year = col_integer(),
Anomaly1y = col_character(),
Anomaly5y = col_character(),
Anomaly10y = col_double(),
Unc10y = col_double()
)
ggplot(climate, aes(Year, Anomaly10y)) +
geom_line()
We can also plot confidence regions
Anomaly10y is a 10-year running average of the deviation (in Celsius) from the average 1950–1980 temperature, and Unc10y is the 95% confidence interval.confidence interval gives a range of plausible values for a parameter. It depends on a specified confidence level with higher confidence levels corresponding to wider confidence intervals and lower confidence levels corresponding to narrower confidence intervals. Common confidence levels include 90%, 95%, and 99%.
Usually we don’t just begin chapters with a definition, but confidence intervals are simple to define and play an important role in the sciences and any field that uses data. You can think of a confidence interval as playing the role of a net when fishing. Instead of just trying to catch a fish with a single spear (estimating an unknown parameter by using a single point estimate/statistic), we can use a net to try to provide a range of possible locations for the fish (use a range of possible values based around our statistic to make a plausible guess as to the location of the parameter).
We’ll set ymax and ymin to Anomaly10y plus or minus Unc10y (Figure 4-25):
ggplot(climate, aes(Year, Anomaly10y)) +
geom_ribbon(aes(ymin = Anomaly10y - Unc10y, ymax = Anomaly10y + Unc10y),
fill = "blue", alpha = .1) +
geom_line(color = "steelblue")
Modify the previous plot and change it such that there are three lines instead of one with a confidence band.
cplot <- ggplot(climate, aes(Year, Anomaly10y))
cplot <- cplot + geom_line(size = 0.7, color = "black")
cplot <- cplot + geom_line(aes(Year, Anomaly10y + Unc10y), linetype = "dashed", size = 0.7, color = "red")
cplot <- cplot + geom_line(aes(Year, Anomaly10y - Unc10y), linetype = "dashed", size = 0.7, color = "red")
cplot + theme_gray()
#theme_classic
#theme_bw()
#theme_minimal()
Using scatter plot not the best way to visualize change over time. Let’s use line plot.
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
geom_line()
added a by aesthetic to get a line per country and color by continent
visualize both lines and points on the plot?
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
geom_line() + geom_point()
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) +
geom_line(aes(color=continent)) + geom_point()
ggplot to the geom_line layer so it no longer applies to the pointsggplot(iris, aes(Species, Sepal.Length)) +
geom_bar(stat = "identity")
library(tidyr)
#df <- melt(iris, id.vars = "Species")
df <- gather(iris, variable, value, -Species )
ggplot(df, aes(Species, value, fill = variable)) +
geom_bar(stat = "identity")
The heights of the bars commonly represent one of two things: either a count of cases in each group, or the values in a column of the data frame. By default, geom_bar uses stat=“bin”. This makes the height of each bar equal to the number of cases in each group, and it is incompatible with mapping values to the y aesthetic. If you want the heights of the bars to represent values in the data, use stat=“identity” and map a value to the y aesthetic.
These two packages are the Swiss army knives of R. dplyr * filter * select * mutate * tidyr. * gather * spread * separate
Let’s look at iris again.
iris[1:2, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
df <- gather(iris, variable, value, -Species )
df[1:2, ]
Species variable value
1 setosa Sepal.Length 5.1
2 setosa Sepal.Length 4.9
ggplot(df, aes(Species, value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge")
Using the d2 dataset you created earlier, generate this plot below. Take a quick look at the data first to see if it needs to be binned
d2 <- diamonds[sample(1:dim(diamonds)[1], 1000), ]
ggplot(d2, aes(clarity, fill = cut)) +
geom_bar(position = "dodge")
ifelse function to create clim$signclim <- read.csv('data/climate.csv', header = TRUE)
clim$sign <- ifelse(clim$Anomaly10y<0, FALSE, TRUE)
# or as simple as
# clim$sign <- clim$Anomaly10y < 0
ggplot(clim, aes(Year, Anomaly10y)) + geom_bar(stat = "identity", aes(fill = sign)) + theme_gray()
ggplot(faithful, aes(waiting)) + geom_density()
ggplot(faithful, aes(waiting)) +
geom_density(fill = "blue", alpha = 0.1)
ggplot(faithful, aes(waiting)) +
geom_line(stat = "density")
aes(color = variable)
aes(color = "black")
# Or add it as a scale
scale_fill_manual(values = c("color1", "color2"))
library(RColorBrewer)
display.brewer.all()
#df <- melt(iris, id.vars = "Species")
ggplot(df, aes(Species, value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set1")
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
facet_grid(Species ~ .) +
scale_color_manual(values = c("red", "green", "blue"))
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap, color=continent)) +
geom_point()
y axis using the scale functionsalpha function, which is helpful when you have a large amount of data which is v. clusteredggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point() + scale_y_log10()
log10 function applied a transformation to the values of the gdpPercap column before rendering them on the plotThis makes it easier to visualize the spread of data on the y-axis.
We can fit a simple relationship to the data by adding another layer, geom_smooth:
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point() + scale_y_log10() + geom_smooth(method="lm")
geom_smooth layer:pwd <- ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point() + scale_y_log10() + geom_smooth(method="lm", size=1.5)
geom_smooth.aes function to define a mapping between data variables and their visual representation.ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(aes(shape = Species), size = 3) +
geom_smooth(method = "lm")
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(aes(shape = Species), size = 3) +
geom_smooth(method = "lm") +
facet_grid(. ~ Species)
starts.with <- substr(gapminder$country, start = 1, stop = 1)
az.countries <- gapminder[starts.with %in% c("A", "Z"), ]
ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
geom_line() + facet_wrap( ~ country)
facet_wrap layer took a “formula” as its argument, denoted by the tilde (~).ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
geom_line() + facet_wrap( ~ country) +
xlab("Year") + ylab("Life expectancy") + ggtitle("Figure 1") +
scale_colour_discrete(name="Continent") +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())
http://swcarpentry.github.io/r-novice-gapminder/08-plot-ggplot2#challenge-5
#str(iris)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
facet_grid(Species ~ .)
### And along rows
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
facet_grid(. ~ Species)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
facet_wrap( ~ Species)
Themes are a great way to define custom plots.
+theme()
?theme() for more optionsggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 1.2, shape = 16) +
facet_wrap( ~ Species) +
theme(legend.key = element_rect(fill = NA),
legend.position = "bottom",
strip.background = element_rect(fill = NA),
axis.title.y = element_text(angle = 0))
#install.packages("ggthemes")
library(ggthemes)
+ theme_stata()+ theme_excel()+ theme_wsj()+ theme_solarized()ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 1.2, shape = 16) +
facet_wrap( ~ Species) +
theme_solarized() +
theme(legend.key = element_rect(fill = NA),
legend.position = "bottom",
strip.background = element_rect(fill = NA),
axis.title.y = element_text(angle = 0))
ggsave('~/path/to/figure/filename.png')
ggsave(plot1, file = "~/path/to/figure/filename.png")
ggsave(file = "/path/to/figure/filename.png", width = 6,
height =4)
ggsave(file = "/path/to/figure/filename.eps")
ggsave(file = "/path/to/figure/filename.jpg")
ggsave(file = "/path/to/figure/filename.pdf")
This is just a taste of what you can do with ggplot2. RStudio provides a really useful cheat sheet of the different layers available, and more extensive documentation is available on the ggplot2 website. Finally, if you have no idea how to change something, a quick Google search will usually send you to a relevant question and answer on Stack Overflow with reusable code to modify!